3 UMAP

We’ll use the mathemagical Uniform Manifold Approximation and Projection (UMAP) algorithm to project the already dimension-reduced data (150 singular vectors) into 2-space. UMAP is a dimension reduction technique that builds on the notion neighbor graphs with ideas from topology. It is similar to t-SNE in its approach, but the fundamentals are based on firmer (and more complicated) mathematical theory (manifolds/topology).

# svd_ump = umap(svd$v)
# save(svd_ump, file='svd_ump.RData')
load('docs/final_data_plots/svd_ump.RData')

fig <- plot_ly(type = 'scatter', mode = 'markers')
fig <- fig %>%
  add_trace(
    x = svd_ump$layout[,1],
    y = svd_ump$layout[,2],
    text = ~paste('heading:', head ,"$<br>text: ", raw_text  ),
    hoverinfo = 'text',
    marker = list(color='green', opacity=0.6),
    showlegend = F
  )

fig

Outliers causing annoying viz issues requiring the zoom. We will routinely omit these outliers (after noting they make nice clusters of related documents) when creating the plot to avoid having to zoom on the main plot.

index_subset = abs(svd_ump$layout[,1]) <20 & abs(svd_ump$layout[,2]) <20
data_subset = svd_ump$layout[index_subset,]
raw_text_subset = raw_text[index_subset]
head_subset = head[index_subset]

fig <- plot_ly(type = 'scatter', mode = 'markers')
fig <- fig %>%
  add_trace(
    x = data_subset[,1],
    y = data_subset[,2],
    text = ~paste('heading:', head_subset ,"$<br>text: ", raw_text_subset ),
    hoverinfo = 'text',
    marker = list(color='green'),
    showlegend = F
  )

fig

After omitting the straggler points on the outskirts, we see a nice plot that looks like it has some nice cluster separation.